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ABSTRACT 


A  trace-driven  simulation  study  of  dynamic  load  balancing  in  homogeneous 
distributed  systems  supporting  broadcasting  is  presented.  We  use  inform  \tbn 
about  job  CPU  and  I/O  demands  collected  from  a  production  system  as  input  to 
a  simulation  model  that  includes  a  representative  CPU  scheduling  policy  and  con¬ 
siders  the  message  exchange  and  job  transfer  costs  explicitly.  Seven  load  balanc¬ 
ing  algorithms  are  simulated  and  their  performances  compared.  We  find  that 
load  balancing  is  capable  of  significantly  reducing  the  mean  and  standard  devia¬ 
tion  of  job  response  times,  especially  under  heavy  system  bad,  and  for  job*  with 
high  resource  demands.  The  performances  of  all  hosts,  even  those  originally  with 
light  bads,  are  generally  improved  by  bad  balancing.  The  reduetbn  of  the  mean 
response  time  increases  with  the  number  of  hosts,  but  levels  off  at  around  30 
hosts.  Algorithms  based  on  periodic  or  non-periodic  bad  informatbn  exchange 
provide  similar  performance,  and,  among  the  periodic  policies,  the  algorithms  that 
use  a  distinguished  agent  to  collect  and  distribute  bad  informatbn  cut  down  the 
overhead  and  scale  better.  They  are  also  the  most  appropriate  algorithms  for 
adaptive  bad  balancing,  which  has  the  potential  of  offering  near-optimal  perfor¬ 
mance  under  a  wide  spectrum  of  system  coafiguratbns  and  bad  conditbns.  Sys¬ 
tem  instability  in  the  form  of  host  overbading  u  possible  when  the  bad  informa¬ 
tbn  is  not  up-to-date  and  the  system  is  under  heavy  bad;  however,  this  undesir¬ 
able  phenomenon  can  be  alleviated  by  simple  measures.  Load  balancing  is  still 
very  effective  even  when  up  to  half  of  the  eligible  jobs  have  to  be  executed 
locally.  The  trace-driven  simulatbn  approach  to  the  study  of  bad  balancing  is 
found  to  be  efficient  and  effective,  and  is  recommended  for  use  before  implemen¬ 
tation  efforts. 
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1.  Introduction 

Distributed  computer  systems  are  becoming  increasingly  available  because  of  the  drop  in 
hardware  costs  and  advances  in  computer  networking  technologies.  An  important  advantage  of 
distributed  systems  is  the  potential  of  resource  sharing  to  provide  the  users  with  a  rich  collection 
of  resources  that  are  usually  unavailable  or  highly  contended  for  in  stand-alone  systems.  Exam¬ 
ples  of  sharable  resources  are  files,  computing  power,  and  printers.  It  is  frequently  observed  that, 
in  a  computing  environment  with  a  number  of  hosts  connected  by  networks,  there  is  a  high  proba¬ 
bility  that  some  of  the  hosts  are  heavily  loaded,  while  others  are  almost  idle.  Even  if  the  hosts 
are  evenly  loaded  over  long  periods,  such  as  half  an  hour  or  more,  the  instantaneous  loads  are 
likely  to  be  fluctuating  constantly,  t  This  suggests  that  performance  gains  may  be  achieved  by 
transferring  jobs  from  the  currently  heavily  loaded  hosts  to  the  lightly  loaded  ones.  This  form  of 
computing  power  sharing,  with  the  purpose  of  improving  the  performance  of  a  distributed  system 
by  redistributing  the  workload  among  the  available  hosts,  is  commonly  called  load  balancing,  or 
load  sharing,  ( 

The  problem  of  load  balancing  has  been  studied  using  a  number  of  different  approaches  over 
the  years.  The  early  works  mainly  concentrated  on  static  load  balancing  [3,  18,  19,  22).  In  those 
studies,  job  transfer  decisions  are  made  deterministicalty  or  probabilistically  without  taking  into 
consideration  the  current  state  of  the  system.  The  problem  of  program  module  assignment  has 
also  been  studied  in  a  number  of  forms,  with  the  basic  assumption  that  the  program  concerned 
can  be  partitioned  into  a  number  of  modules  with  known  resource  consumptions  and  inter-module 
communication  costs.  Load  balancing  is  formulated  as  a  mathematical  programming  or  network 
flow  problem,  and  solved  by  optimising  some  performance  index  such  as  the  average  response 
time  or  the  resource  utilisations. 

Static  load  balancing  is  simple  and  effective  when  the  workload  can  be  sufficiently  well 
characterised  beforehand,  but  it  fails  to  adjust  to  the  fluctuations  in  system  load.  In  contrast, 
dynamic  load  balancing*  attempts  to  balance  the  system  load  dynamically  as  jobs  arrive. 
Because  of  its  generality  and  ability  to  respond  to  temporary  system  unbalances,  dynamic  load 
balancing  has  received  increasing  attention  from  the  research  community  [7,  8,  9,  12,  16,  17,  21]. 
Livny  and  Melman  [17]  showed,  using  simple  queuing  network  models  and  simulation,  that 
dynamic  load  balancing  can  greatly  improve  average  job  response  time.  They  also  proposed  a 
number  of  implementable  algorithms  for  load  balancing.  Eager  et  al.  [9|  carried  the  work  further 
by  systematically  studying  a  number  of  dynamic  load  balancing  algorithms  with  different  levels  of 
complexity.  Their  results  confirmed  the  great  potential  of  load  balancing.  They  also  claimed  that 
relatively  simple  algorithms  can  provide  substantial  performance  improvements,  while  more  com¬ 
plicated  algorithms  are  not  likely  to  offer  much  further  improvement.  Wang  and  Morris  [21] 

t  Such  observation*,  of  court*,  trt  dependent  os  the  system  and  the  application*  being  nut.  For  instance,  in  a 
main-frame  batch  data  proeceting  environment,  the  loads  might  be  even  over  long  periods  of  torn.  In  contrast, 
however,  in  a  workstation- rich  environment,  which  is  becoming  more  and  more  popular,  the  probability  of  a  ma¬ 
jority  of  the  stations  being  idle  or  almost  idle  is  very  high  (30). 

t  The  term  load  balancing  hie  so  me  times  been  used  to  imply  the  objective  of  equalizing  the  loads  of  the  hosts, 
whereas  load  sharing  simply  mean*  a  redistribution  of  the  workload.  We  will  us*  the  term  load  balancing  in  the 
rest  of  this  paper,  but  without  the  stronger  connotation. 

*  So  ms  authors  used  the  terms  sdspttv*  lead  h eleneing  tad  dynamic  load  balancing  interchangeably.  W*  decid¬ 
ed,  however,  to  reserve  the  former  for  a  particular  form  of  load  balancing  to  be  described  later  in  this  paper. 
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conducted  a  comprehensive  study  and  pointed  out  that  the  choice  of  a  load  balancing  algorithm  is 
a  crucial  design  decision.  They  also  proposed  a  performance  metric  called  the  Q-faetor,  and  used 
it  to  evaluate  the  quality  of  the  algorithms.  Leland  and  Ott  [16]  performed  an  extensive  study  of 
process  behavior  in  the  VAX/UNIX  environment  and  evaluated  the  usefulness  of  initial  process 
assignment  and  process  migration  as  forms  of  load  balancing.  A  number  of  other  researchers  have 
also  considered  process  migration  in  their  load  balancing  algorithms  |1,  4].  To  limit  the  scope  of 
our  study  to  a  manageable  level,  however,  we  will  not  consider  process  migration  in  this  paper. 
Process  migration  is  also  much  more  difficult  to  implement,  and  involves  higher  costs  in  most  sys¬ 
tems. 

Although  different  authors  make  very  different  assumptions  about  system  structures  and 
overhead  costs,  the  main  tools  of  study  in  dynamic  load  balancing  have  been  queuing  network 
models  and  simulation  with  probabilistic  assumptions  about  job  arrivals  and  resource  demands. 
Unfortunately,  a  reasonably  accurate  analytic  model  for  a  real-world  system  with  a  load  balancing 
scheme  of  modest  complexity  can  be  very  difficult  to  construct.  Solving  the  models  is  even 
harder.  Consequently,  many  researchers  are  forced  to  make  simplifying  assumptions  that  are 
often  unrealistic,  rendering  the  results  of  the  studies  subject  to  suspicion.  For  example,  in  order 
to  make  the  model  tractable,  the  job  interarrival  time  and  the  job  execution  time  are  often 
assumed  to  be  exponentially  distributed.  The  utilisations  of  the  hosts  are  sometimes  assumed  to 
be  the  same,  and  the  effects  of  the  system  scale  on  load  balancing  performance  are  often  ignored. 
For  similar  reasons,  the  costs  of  exchanging  load  information  and  other  types  of  costs  associated 
with  load  balancing  are  often  ignored  or  grossly  simplified.  Simulation  models  driven  by  probabil¬ 
ity  distributions  are  capable  of  handling  greater  system  complexity  and  thus  solving  a  larger  class 
of  problems,  but  it  is  still  unclear  bow  much  error  in  the  results  is  introduced  by  the  distributional 
assumptions  made  by  the  investigators. 

To  substantiate  these  criticisms,  we  traced  a  production  VAX/UNIX*  system  for  a  number 
of  extended  periods  during  working  hours,  and  recorded  the  arrival  times  of  the  processes!,  as  well 
as  their  CPU  and  disk  I/O  demands.  The  distributions  of  these  measurements  are  shown  in  Fig¬ 
ures  1,  2  and  3,  respectively.  It  can  be  seen  that  none  of  them  follow  an  exponential  pattern. 
The  inter-arrival  time  distribution  is  not  very  far  from  exponential,  whereas  the  CPU  and  I/O 
demand  distributions  are  both  highly  skewed  f.  Similar  observations  have  been  made  by  other 
researchers  [5,  8,  16] . 

In  this  paper,  we  study  the  problem  of  dynamic  load  balancing  using  an  approach  different 
from  those  mentioned  above.  Job  traces  collected  from  a  production  system  are  used  to  drive  a 
simulation  program  that  implements  a  number  of  load  balancing  algorithms.  In  this  way,  we 
eliminate  the  errors  caused  by  assumptions  about  the  workload.  The  costs  of  message  exchanges 
and  job  transfers  are  considered  so  that  performance  comparisons  between  the  algorithms  can  be 
made  on  an  equal  basis.  Two  broad  categories  of  algorithms  are  commonly  recognised.  In  source 

*  UNIX  is  a  trademark  of  AT&T  BeQ  Laboratories. 

t  In  •  UNIX  system,  a  jtk  corresponds  to  a  command  tins  input  by  a  user,  and  a  number  of  pretenses  may  be 
created  to  cany  out  tbe  job.  We  wtD  not  insist  on  this  distinction  in  this  paper,  however. 

t  For  n  job's  I/O  demand,  both  synchronous  and  asynchronous  disk  I/O's  are  considered,  while  disk  cache  hits 
are  property  excluded. 


-  4  - 


Figure  1.  Distribution  of  job  inter-arrival  time*. 

initiative  algorithms,  the  hosts  where  jobs  arrive  take  the  initiative  to  transfer  the  jobs,  whereas 
in  server  initiative  algorithms,  hosts  able  and  willing  to  receive  transferred  jobs  go  out  to  find 
such  jobs.  A  host  may  well  be  a  source  and  a  server  at  the  same  time.  We  concentrate  on  source 
initiative  algorithms  in  this  paper. 

A  load  balancing  algorithm  consists  of  a  number  of  components. 

(1)  The  information  policy  specifies  the  amount  of  load  and  job  information  made  available  to 
job  placement  decision  makers),  and  the  way  by  which  the  information  is  distributed.  We 
may  require  that  the  loads  of  all  the  hosts  in  the  system  be  available  to  the  decision 
maker(s).  Alternatively,  no  or  only  partial  information  may  be  available.  Periodic  updates 
may  be  used  to  distribute  load  information,  or  the  information  may  be  provided  upon 
request  (demand-polling).  A  distinguished  agent  may  be  involved  in  the  bad  information 
distributba,  or  no  such  agent  may  exist. 

(2)  The  transfer  policy  determines  the  eligibility  of  a  job  for  bad  balancing  based  on  the  job 
and  the  bads  of  the  boots.  It  may  not  be  desirable,  for  example,  to  transfer  small  jobs,  aad 
some  jobs  may  require  specific  resources  available  only  on  certain  hosts,  thus  being  unsuit' 
abb  for  consideratbn. 

(3)  The  placement  policy  decides,  for  eligibb  jobs,  the  hosts  to  whbh  the  jobs  should  be 
transferred.  An  attempt  may  be  made  to  select  the  bast  baded  host  in  the  system,  or  only 
an  acceptable  host  is  sought  so  that  less  bad  information  is  needed.  If  no  suitabb  host  can 
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Ex ecution  Time  (power  of  2) 

Figure  2.  Cumulative  distribution  of  job  CPU  times  (in  seconds), 
be  found,  the  jobs  will  have  to  be  processed  locally. 

The  above  three  component  policies  of  a  load  balancing  algorithm  are  not  isolated  from  each 
other,  but  interact  in  various  ways.  For  example,  the  load  information  available  limits  the  possi¬ 
ble  transfer  policies.  Because  of  the  large  number  of  options  for  each  component  policy,  it  is 
impossible  to  study  all  possible  policy  combinations  in  this  paper.  Instead,  we  shall  concentrate 
on  the  information  policies  and  some  of  the  related  placement  policies,  while  keeping  the  other 
aspects  of  the  scheme  fixed.  Specifically,  we  are  interested  in  comparing  the  performances  of  the 
algorithms  using  periodic  updates  and  of  those  acquiring  information  on  demand.  For  the  periodic 
policies,  we  want  to  evaluate  the  performance  impact  of  a  global  agent  that  collects  and  distri¬ 
butes  load  information  of  all  the  hosts  in  the  system.  We  also  want  to  study  the  problem  of  ins¬ 
tability  caused  by  a  number  of  hosts  sending  jobs  all  at  the  same  time  to  a  lightly  loaded  host, 
thus  making  it  overloaded.  A  number  of  representative  load  balancing  algorithms  are  defined  and 
studied  in  detail.  However,  our  objective  is  not  to  select  the  best  algorithm,  but  rather,  to  study 
the  characteristics  of  various  types  of  algorithms  and  the  tradeoffs  between  conflicting  require¬ 
ments. 

The  important  results  from  this  study  include  the  following: 

s  A  load  balancing  scheme  using  any  reasonable  algorithm  can  improve  the  job  response  times 
by  30-60%,  and  make  them  much  more  predictable. 
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Figure  3.  Distribution  of  the  number  of  disk  I/O’s  per  job. 

•  The  memn  response  times  of  jobs  on  every  host,  even  on  those  origin  illy  with  light  loads,  ire 
reduced  by  load  balancing. 

•  Periodic  and  non-periodic  information  policies  provide  comparable  performance. 

•  For  the  periodic  information  policies,  the  global  algorithms  impose  less  overhead  on  the  sys¬ 
tem  than  the  distributed  ones  (typically  half  or  less  for  systems  with  20  or  more  hosts),  and, 
hence,  can  support  larger  systems. 

•  Greater  performance  improvement  can  be  gained  by  increasing  the  system  site,  but  the 
improvement  levels  off  beyond  a  few  tens  of  hosts,  at  which  point  it  becomes  more  advanta¬ 
geous  to  implement  load  balancing  in  clusters. 

•  Instability  may  occur  when  load  information  is  stale  and  the  system  load  is  high,  but  it  can 
be  alleviated  by  simple  measures. 

s  Load  balancing  can  still  be  highly  effective  when  up  to  half  of  the  jobs  that  are  otherwise  eli¬ 
gible  for  load  balancing  must  be  executed  on  their  local  hosts. 

Our  study  also  provides  insights  into  the  choice  of  a  load  balancing  algorithm  under  different  sys¬ 
tem  environments  and  load  conditions. 

In  the  next  section,  we  describe  the  system  we  simulate  and  the  structure  of  the  model.  We 
also  discuss  the  load  and  performance  indices  we  use.  Section  3  describes  the  algorithms  that  we 
studied  in  the  simulation.  The  simulation  results  are  presented  in  Section  4,  along  with  a  discus¬ 
sion  and  comparison  of  the  algorithms.  Some  concluding  remarks  are  made  in  Section  5. 
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3.  Experiment  Design 

3.1.  The  Job  Trace 

A  distinguishing  feature  of  our  study  is  the  use  of  job  traces  instead  of  probability  distribu¬ 
tions  to  describe  the  arrival  times  and  resource  demands  of  the  jobs.  We  traced  a  production 
VAX-11/780  system  running  Berkeley  UNIX  to  collect  job  traces  consisting  of  tuples  of  the  for¬ 
mat 

<job  arrival  time,  CFV  time  demand,  number  of  disk  l/0's>. 

Previous  measurement  studies  conducted  by  the  author  [23]  show  that  the  CPU  is  the  most  con¬ 
tended  resource  in  the  type  of  time-sharing  systems  from  which  the  job  traces  are  derived.  There 
is  usually  plenty  of  main  memory,  hence  little  paging  and  almost  no  forced  process  swapping 
occur.  The  networking  subsystem  is  not  heavily  loaded  either.  Therefore,  we  will  consider  only 
CPU  and  disk  I/O  in  our  model,  while  retaining  confidence  in  the  results  of  the  simulation. 

Heterogeneity,  either  architectural  or  configurational,  complicates  the  load  balancing  prob¬ 
lem  greatly,  and  is  a  deviation  from  the  primary  concerns  of  this  research.  Therefore,  we  will 
concentrate  on  homogeneous  systems.  In  fact,  to  insure  homogeneity  and  to  ease  the  trace  collec¬ 
tion  efforts,  sessions  of  job  traces  were  collected  on  the  same  host  at  different  times  to  represent  a 
number  of  hosts  connected  by  a  network,  t  The  selection  of  simulation  session  length  is  important 
because  the  boundary  effects  caused  by  jobs  started  before  the  session  begins  and  by  those  finish¬ 
ing  after  the  session  ends  may  significantly  affect  the  accuracy  of  the  results.  On  the  other  hand, 
longer  sessions  involve  greater  efforts  in  trace  collection  and  simulation.  We  chose  the  length  of 
each  session  to  be  four  hours.  Typically,  about  6000  processes  are  created  on  each  host  during 
this  period.  Even  so,  some  of  the  processes  executing  during  a  session  are  not  included.  Such 
processes  are  mostly  system  services  that  are  started  at  system  boot  time  and  run  until  the  sys¬ 
tem  goes  down,  and  a  few  very  long  batch  jobs.  Though  small  in  number,  they  can  represent  a 
significant  portion  of  CPU  time  consumption.  As  a  result,  the  simulated  CPU  utilisations  during 
the  sessions  are  lower  than  in  reality,  typically  by  5-15  percent. 

1.1.  Model  Structure 

The  simulation  model  is  of  event-driven  type  [11],  and  its  structure  is  shown  in  Figure  4. 
We  adopt  a  foreground-background  round-robin  scheduling  policy  for  the  CPU.  The  time  quan¬ 
tum  is  100  milliseconds,  the  same  as  that  used  in  the  Berkeley  UNIX  system  from  which  the  trace 
was  derived.  After  a  job  has  accumulated  500  milliseconds  of  CPU  time,  it  is  put  into  the  back¬ 
ground  queue,  which  will  be  checked  only  if  no  job  is  available  in  the  foreground  queue.  Since 
about  60-65%  of  the  jobs  have  execution  times  below  this  threshold,  they  will  not  be  sent  to  the 
background  queue,  thus  receiving  priority  service.  While  the  CPU  scheduling  policies  in  computer 
systems  are  usually  more  complicated,  we  feel  that  the  above  policy  captures  their  essential 
features,  and  may  be  considered  representative.  Since  the  level  of  contention  at  the  disks  is  usu¬ 
ally  low  under  normal  operating  conditions  in  the  type  of  system  we  measured  [23],  we  model 


t  It  it  recognised  that,  by  to  doing,  the  possible  temporal  correlations  between  the  toads  of  the  Yahoos  hosts  are 
lost. 
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Figure  4.  Structure  of  system  used  in  simulation. 

them  as  infinite  servers  causing  only  processing  delays,  but  no  queuing  delays.  I/O  operations  are 
assumed  to  be  evenly  spread  throughout  the  execution  of  the  job  $,  and  each  disk  I/O  is  assumed 
to  take  30  milliseconds.  A  communication  network  permits  message  passing  and  job  transfers 
between  the  hosts.  Since  we  are  most  interested  in  load  balancing  in  local  distributed  systems,  we 
assume  that  the  underlying  network  supports  broadcast  (e.g.,  Ethernet).  We  also  assume  the 
existence  of  a  distributed  file  system  so  that  the  the  costs  of  accessing  the  program  and  data  files 
are  roughly  the  same  for  all  of  the  hosts.  As  a  result,  the  files  do  not  have  to  be  moved  with  the 
jobs  to  be  load  balanced.  This  assumption  will  be  increasingly  appropriate  for  future  systems 
designed  for  distributed  computing.  Since  our  trace  data  is  derived  from  a  time-sharing  system 
without  the  support  of  a  distributed  file  system,  we  are  unable  to  simulate  the  contention  at  the 
file  servers,  and  we  also  do  not  have  measurement  data  on  remote  file  accesses.  The  cost  of  30 
milliseconds  for  an  I/O  operation  is  therefore  a  rough  approximation. 


t  Recording  the  time  of  the  I/O  operations  during  job  execution  would  great (jr  complicate  our  trace  collection 
effort  and  the  model  construction  and  simulation,  without  providing  significant  benefit,  in  terms  of  model  accu¬ 
racy,  since  the  disks  are  not  the  points  of  contention. 
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2.3.  Coat  Assumption! 

There  are  basically  two  types  of  overhead  costs  involved  in  load  balancing.  First,  current 
load  indices  of  the  hosts  have  to  be  computed  and  messages  exchanged  to  make  them  known  to 
the  decision  makers.  Secondly,  placement  decisions  need  to  be  made  and  jobs  transferred  between 
the  hosts.  CPU  time  and  network  bandwidth  are  consumed  for  these  purposes.  The  latter  of 
overhead  also  directly  introduces  extra  delays  in  the  jobs  involved.  (So  is  the  former  if  load  infor¬ 
mation  is  acquired  while  the  job  to  be  balanced  is  waiting,  as  is  the  case  with  a  number  of  algo¬ 
rithms  to  be  studied.)  It  has  been  experimentally  observed  that,  in  most  current  installations,  local 
area  networks,  such  as  the  Ethernet,  usually  have  plenty  of  bandwidth,  and  the  delays  in  the  net¬ 
work  are  small  compared  to  the  CPU  cost  of  executing  the  communication  protocols  (15).  Conse¬ 
quently,  we  only  consider  CPU  time  overhead  in  this  study.  We  assume  that  message  exchange 
and  job  transfer  have  preemptive  priority  over  job  execution.  Based  on  measurements  from  our 
experimental  implementations  of  load  balancing  on  the  VAX/UNIX  and  SUN/UNIX  machines,  we 
assume  that  computing  the  current  load  and  sending  it  out  takes  20  milliseconds  of  CPU  time, 
while  receiving  load  information  and  processing  it  takes  10  milliseconds.  A  job  transfer  is 
assumed  to  take  100  milliseconds  of  CPU  time  for  both  the  sending  and  the  receiving  host,  and 
causes  200  milliseconds  delay  to  the  job  being  transferred.  This  assumption  seems  to  be  less  criti¬ 
cal  than  that  for  message  cost  because  the  algorithms  we  study  mainly  differ  in  their  information 
policies;  a  change  in  job  transfer  cost  is  likely  to  change  their  performances  by  similar  amounts. 

It  should  be  pointed  out  that  the  above  cost  assumptions  are  very  approximate;  the  actual 
costs  in  terms  of  the  CPU  times  spent  and  the  job  delays  introduced  are  highly  sensitive  to  the 
load  conditions  of  the  hosts  involved  and  the  network  load.  They  are  also  dependent  on  the 
implementation  of  the  underlying  system,  as  well  as  on  the  size  of  the  message  and  on  that  of  the 
job. 


2.4.  Load  and  Performance  Metrics 

In  order  to  compare  the  performances  of  various  load  balancing  algorithms,  we  need  a 
number  of  metrics.  First,  it  is  important  to  characterize  the  load  on  the  whole  system,  as  the  per¬ 
formance  of  load  balancing  schemes  varies  with  the  system  load.  We  choose  the  average  CPU 
utilization  of  all  the  hosts  over  the  entire  session  as  the  load  level  indicator  since  it  represents  the 
level  of  contention  for  the  most  critical  resources  in  the  system.  We  are  also  interested  in  a  load 
index  that  we  can  use  to  predict  the  response  time  of  a  job  if  that  job  is  executed  on  a  particular 
host.  Ferrari  [1 1)  pointed  out,  using  mean  value  analysis,  that  a  linear  combination  of  the 
resource  queue  lengths  in  a  computer  system  can  be  an  excellent  predictor  of  job  response  time, 
with  the  coefficients  being  the  estimated  resource  consumptions  of  the  job.  In  a  previous  measure¬ 
ment  study  (23],  we  found  that  the  CPU  queue  length  has  a  high  correlation  with  the  job  response 
time  in  a  CPU-bound  host,  and  hence  suggests  itself  as  a  good  load  index. 

To  measure  and  compare  the  effectiveness  of  load  balancing  algorithms,  we  need  to  define  a 
performance  index.  We  choose  the  mean  job  response  time  because  decreasing  the  job  response 
time  is  our  primary  objective  of  load  balancing.  However,  this  does  not  measure  the  variability  of 
the  job  response  times.  We  will  use  the  standard  deviation  of  the  response  times  of  all  the  jobs  to 
complement  the  mean  response  time. 
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3.  Load  Balancing  Algorithm* 

We  studied  seven  algorithms  that  use  different  types  of  information  policies  and  related 
placement  policies.  For  ease  of  comparison,  we  base  the  transfer  policy  of  all  the  algorithms  on 
the  local  host  load  and  job  execution  time  thresholds.  When  the  CPU  queue  length  of  a  host  is  at 
or  below  a  threshold,  all  jobs  arriving  there  are  processed  locally.  Otherwise,  ail  the  jobs  arriving 
at  that  host  and  with  execution  times  above  a  certain  threshold  are  eligible  for  load  balancing. 
Although  job  execution  times  are  difficult  to  predict,  it  is  possible  to  classify  the  jobs  into  two 
rough  categories:  “big”  jobs  which  are  worth  considering  for  load  balancing,  and  “small”  jobs  not 
to  be  considered.  Moreover,  estimation  errors  can  be  easily  tolerated,  as  long  as  they  are  not  too 
frequent.  Our  studies  of  jobs  submitted  over  30  days  show  that  such  a  classification  can  be  made 
with  a  very  high  success  rate  simply  by  looking  at  the  job  names.  For  example,  a  text  processing 
job  will  almost  certainly  take  over  1  second  of  CPU  time,  whereas  a  directory  checking  operation 
is  clearly  not  worth  considering  for  load  balancing.  One  result  of  this  research  is  that  the  perfor¬ 
mance  of  the  load  balancing  algorithms  is  quite  robust  with  regard  to  the  job  execution  time 
threshold  (See  Section  4.4). 

The  following  algorithms  were  studied: 

GLOBAL 

Every  T  seconds,  one  of  the  hosts,  designated  as  the  load  information  center  (L1C),  receives 
load  updates  from  all  the  other  hosts  and  assembles  them  into  a  load  vector,  which  is  then 
broadcast  to  all  the  other  hosts.  If  the  load  of  a  host  is  the  same  as  that  sent  out  the  last 
time,  however,  no  update  needs  to  be  sent  to  the  LIC.  This  applies  to  the  next  algorithm, 
DISTED,  as  well. 

The  placement  policy  of  the  GLOBAL  algorithm,  as  well  as  that  of  the  next  algorithm,  is  as 
follows.  The  local  version  of  the  load  vector  is  searched  for  a  host  with  the  shortest  CPU 
queue  length,  and,  if  the  difference  in  CPU  queue  length  between  the  local  host  and  the 
potential  destination  is  at  or  above  a  given  limit  l  (usually  1  or  2),  the  job  is  sent  there.  If 
there  are  several  hosts  with  the  same  shortest  queue  length,  which  is  often  the  case,  the  first 
one  is  selected.  This  rule,  together  with  a  randomized  starting  point  for  the  search,  can 
potentially  alleviate  the  instability  problem  as  we  will  discuss  later. 

DISTED 

Instead  of  reporting  the  local  load  to  a  centralized  LIC  as  in  GLOBAL,  each  host  broadcasts 
its  load  periodically  for  the  other  hosts  to  update  their  locally  maintained  load  vector. 

CENTRAL 

In  the  above  two  algorithms,  placement  decisions  are  made  by  each  host  using  the  local  ver¬ 
sion  of  the  load  vector.  In  the  CENTRAL  algorithm,  there  exists  a  central  scheduler  for  all 
the  hosts.  When  a  host  decides  that  a  job  is  eligible  for  load  balancing,  it  sends  a  request  to 
the  central  scheduler,  together  with  the  current  value  of  its  load.  The  central  scheduler 
selects  a  host  with  the  shortest  queue  length  and  informs  the  originating  host  to  send  the  job 
there.  The  load  vector  on  which  the  scheduler  bases  its  decisions  is  updated  using  only  the 
load  information  sent  by  the  hosts  with  the  job  requests. 
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CENTEX 

The  same  as  CENTRAL  except  that,  periodically,  each  host  sends  its  local  load  to  the  LIC 
(CENTral  with  Exchange).  This  algorithm  can  be  regarded  as  a  hybrid  of  GLOBAL  and 
CENTRAL. 

For  the  above  four  algorithms,  the  load  vector  used  in  the  placement  decision  is  updated  by 
increasing  the  load  of  the  destination  host  by  an  adjustable  constant  (currently  1).  All  the  algo¬ 
rithms  assume  that  the  loads  of  all  the  hosts  are  known  to  the  placement  decision  makers,  with 
some  delay.  The  algorithms  below  use  less  system  state  information,  and  thus  have  smaller  over¬ 
head  costs. 

RANDOM 

This  algorithm  uses  minimum  load  information.  When  a  job  is  found  to  be  eligible  for  load 
balancing,  it  is  sent  to  a  randomly  selected  host.  The  receiving  host  treats  the  transferred 
job  exactly  as  if  it  had  arrived  locally.  To  avoid  the  undesirable  situation  in  which  a  job 
bounces  around  indefinitely,  we  set  a  limit  L\  such  that  the  L(’th  host  receiving  the  job  has 
to  process  it  no  matter  what  its  load  is. 

THRHLD 

A  number  of  hosts  up  to  a  limit  Lr  are  polled  when  an  eligible  job  arrives,  and  the  job  is 
transferred  to  the  first  host  whose  load  is  below  a  fixed  threshold.  If  no  such  host  is  found, 
the  job  is  processed  locally.  When  the  message  exchange  cost  is  much  lower  than  the  job 
transfer  cost,  this  algorithm  wins  over  RANDOM  by  avoiding  costly  job  transfers. 

LOWEST 

This  is  similar  to  THRHLD  except  that,  instead  of  using  a  threshold  for  the  placement,  a 
fixed  number  of  hosts  are  polled  and  the  most  lightly  loaded  host  is  selected.  Thus,  when 
message  overhead  is  higher,  a  potentially  better  host  may  be  selected  than  by  THRHLD. 

The  last  three  algorithms  above  are  identical  to  the  ones  studied  by  Eager  et  al.  (0|  How¬ 
ever,  we  use  a  trace-driven  simulation  method  to  evaluate  them,  and  we  compare  them  to  those 
algorithms  that  use  a  load  vector.  The  algorithms  above  make  placement  decisions  on  the  basis 
of  various  amounts  of  system  state  information.  Since  we  consider  the  overhead  costs  of  load 
balancing  explicitly,  a  direct  assessment  of  the  appropriate  amount  of  toad  information  for  load 
balancing  can  be  made. 

For  comparison,  we  also  implemented  three  boundary  cases  of  load  balancing: 

NoLB 

No  load  balancing  is  attempted;  all  arriving  jobs  are  processed  locally. 

NoCOST 

This  is  the  unrealizable  ideal  case  in  which  the  current  CPU  queue  lengths  of  all  the  hosts 
are  known  to  the  transfer  decision  makers  at  no  cost  (in  terms  of  CPU  time  and  job  delay),, 
and  the  transfers  of  jobs  are  also  assumed  to  be  costless. 

PartCOST 

This  is  the  partly-ideal  case  in  which  perfect  load  information  is  assumed  to  be  known  at  no 
cost,  but  job  transfer  costs  are  considered. 
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Tbe  performance  of  all  the  algorithms  can  be  expected  to  be  between  those  of  NoLB  and 
NoCOST. 

There  are  a  large  number  of  potentially  useful  load  balancing  algorithms  besides  the  ones 
listed  above.  As  we  stated  earlier,  our  primary  concern  in  this  paper  is  not  the  particular  algo¬ 
rithms  to  use,  but  rather  the  effects  of  different  approaches  to  load  information  gathering  and 
placement  decision  making. 

4.  Simulation  Results 

Simulation  runs  with  various  system  sizes  and  load  levels  were  executed.  To  make  the  per¬ 
formance  comparisons  between  the  algorithms  meaningful,  a  number  of  simulation  runs  were  con¬ 
ducted  for  each  algorithm  with  different  adjustable  parameter  values  (e.g.,  job  threshold,  load 
exchange  period),  and  the  best  response  time  was  selected.  In  this  way,  the  comparisons  are 
between  the  best  achievable  performances  of  different  algorithms,  and  it  is  hoped  that  they  reveal 
the  qualities  of  the  algorithms.  The  results  of  the  simulation  experiments  are  presented  in  the  fol¬ 
lowing  sections. 

4.1.  Comparison  of  tbs  Algorithms 

Figure  5  shows  the  average  response  times  of  a  system  of  28  hosts  under  the  load  balancing 
algorithms  described  above.  Since  job  traces  are  used  to  drive  the  model,  we  cannot  control  the 
utilization  of  the  system.  However,  it  is  essential  to  observe  the  performance  of  the  algorithms 
under  various  load  conditions.  We  achieve  this  by  multiplying  the  job  interarmal  times  by  a  con¬ 
stant  factor.  By  varying  the  multiplication  factor,  we  are  able  to  generate  a  number  of  points  for 
each  algorithm.  Although  the  job  stream  is  altered,  the  job  characteristics  (i.e.,  execution  time, 
number  of  I/O)  remain  the  same.  We  feel  that  such  a  modification  to  the  job  stream  is  unlikely 
to  introduce  significant  errors  in  the  results,  f 

The  first  observation  in  Figure  5  is  that  all  the  algorithms  provide  substantial  performance 
improvements  over  a  wide  range  of  system  loads,  compared  to  the  NoLB  case.  In  fact,  response 
times  reasonably  close  to  those  of  the  NoCOST  case  are  achievable.  The  higher  the  system  load, 
the  greater  the  improvements.  While  we  observed  a  greater-than-average  improvement  in  the 
mean  response  time  of  big  jobs  (e.g.,  with  execution  times  greater  than  1  second)  the  mean 
response  time  of  the  small  jobs  does  not  suffer  as  a  result.  Figure  5  also  demonstrates  clearly  the 
relative  performances  of  the  algorithms.  The  performance  of  CENTRAL  is  the  worst  of  the 
seven.  This  is  mainly  because  the  global  scheduler  relies  only  on  the  losd  information  provided 
with  job  scheduling  requests.  It  is  observed  that  the  frequency  of  placement  decision  making  for 
each  boot  is  one  per  5-20  seconds,  when  the  job  threshold  is  0. 5-1.0  second.  At  such  long  inter¬ 
vals,  the  loads  of  the  hosts  are  likely  to  have  changed  substantially.  Consequently,  a  high  percen¬ 
tage  of  the  global  scheduler's  decisions  are  wrong. 

In  sharp  contrast,  the  CENTEX  algorithm,  which  is  the  same  as  CENTRAL  except  that 
load  information  m  periodically  reported  to  the  LIC,  provides  the  best  performance  among  the 


t  Two  other  chokae  wo  to  multiply  the  job  txomtioo  time  by  a  factor,  aad  to  a oo  differeat  Job  wnm  Hum*- 
or,  they  both  alter  the  Job  chwicwriotiet  and  worn  to  ia traduce  more  chan gee  to  the  workload  than  the  method 
we  ueed,  thus  making  the  comparison  of  performances  under  d  if  treat  workload  trreii  teas  meaaiagfai. 
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Figure  5.  Average  response  times  under  different  load  levels  (28  Hosts). 

seven.  It  has  been  widely  assumed  that,  in  distributed  systems,  centralized  solutions  are  undesir¬ 
able  because  they  tend  to  create  performance  bottlenecks  and  single  points  of  failure.  Such  a 
view,  however,  may  be  too  simplistic  if  unqualified.  The  best  solution  is  environment  and  problem 
dependent.  For  load  balancing,  if  the  interprocessor  communication  is  relatively  efficient  (such  as 
the  case  in  this  paper),  and  the  system  Kale  is  limited  (up  to  50-100  hosts),  the  centralized 
approach  to  load  information  distribution  and  job  placement  may  be  simple  and  efficient,  as 
demonstrated  by  Figures  5  and  0.  The  costs  of  job  placements  is  reduced  for  all  the  hosts  except 
the  LIC,  as  they  now  only  need  to  send  local  load  information  and  placement  requests  to  the  LIC, 
rather  than  maintaining  system-wide  load  information  and  performing  placements  themselves. 
For  the  LIC,  we  observed  that  up  to  35%  of  its  CPU  time  may  be  spent  for  load  balancing  func¬ 
tions  supporting  a  system  of  40  hosts.  Although  this  is  a  high  overhead  for  this  host,  it  is  a  small 
price  to  pay  for  the  whole  system.  In  return,  excellent  placement  decisions  based  on  up-to-date 
information  are  achieved.  This  explains  why  the  performance  of  CENTEX  is  slightly  better  than 
those  of  THRHLD  and  LOWEST,  which  only  attempt  to  select  a  host  from  a  small  subset  of  the 
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hosts.  For  many  distributed  applications,  availability  is  crucial,  hence  a  centralized  solution  is  not 
appropriate.  This  is  not  the  case  with  load  balancing,  however.  If  the  LIC  goes  down,  some  other 
host  can  quickly  detect  the  condition  and  take  over  its  role.  The  loss  of  load  information  is  not  a 
serious  problem  because  load  information  becomes  obsolete  in  a  short  while  anyway.  The  brief 
interval  during  which  load  balancing  is  unavailable  should  be  easily  tolerable  because  load  balanc¬ 
ing  is  not  an  essential  system  service  such  as  the  naming  service;  its  absence  should  in  no  way 
interfere  with  system  operations.  In  fact,  an  implementation  using  essentially  the  CENTEX  algo¬ 
rithm  has  been  reported  to  provide  effective  load  balancing  [13|.  In  that  environment,  inter-host 
communication  is  extremely  fast,  and  the  global  scheduler  is  claimed  to  be  able  to  process  1000 
requests  per  second. 

The  comparison  between  the  GLOBAL  and  DISTED  algorithms  is  highly  instructive.  Since 
they  are  the  same,  except  for  their  information  policies,  the  significant  performance  difference 
reveals  the  advantages  of  using  a  global  agent  as  a  relay  point  for  load  information  exchange. 
Assume  that  there  are  N  hosts  in  the  system,  and  let  the  update  period  be  T  seconds,  and  the 
cost  of  sending  and  receiving  a  message  plus  related  processing  be  Mmmi  and  M^,,  respectively. 
For  GLOBAL,  the  overhead  due  to  the  load  information  exchanges  is 


Cue 


(N  -  !)  X  My +.*&■<.  x  l00% 


for  the  LIC,  and 

C  -  Af~w  +  X  100% 

for  the  other  hosts.  Except  for  the  LIC,  the  message  overhead  is  independent  of  the  system  size 
N.  In  contrast,  for  the  DISTED  algorithm,  the  message  overhead  for  every  host  is 

c  -  ~  x  y-r.f X  100% 

because,  during  each  interval  of  duration  T,  every  host  has  to  process  the  messages  broadcast  by 
every  other  host,  f  Compared  to  GLOBAL,  we  do  not  have  a  central  point  of  failure  and  an  extra 
level  of  indirection  in  the  distribution  of  load  information  in  DISTED,  but  the  overhead  is  higher 
for  every  host,  and  grows  linearly  with  the  system  size.  Since  the  availability  considerations  are 
not  important,  as  discussed  above,  the  GLOBAL  algorithm  appears  more  attractive  than 
DISTED. 

One  somewhat  surprising  result  from  Figure  5  is  that  the  two  drastically  different  algo¬ 
rithms,  GLOBAL  and  LOWEST,  provide  almost  identical  response  times  under  a  wide  range  of 
system  loads.  The  GLOBAL  algorithm  uses  more  extensive  system  state  information  in  an  effort 
to  make  optimal  transfer  decisions.  To  achieve  this,  load  information  is  exchanged  at  a  high  fre¬ 
quency,  thus  incurring  high  overhead.  In  the  simulation  runs,  the  value  of  T  that  provides  the 
best  performance  for  GLOBAL  is  between  0.75  and  3  seconds.  At  such  a  high  frequency,  1-3%  of 
the  CPU  time  in  every  host  it  spent  exchanging  load  information.  The  LOWEST  algorithm  does 


t  Du*  to  the  policy  at  not  Madias  out  the  local  load  iaformatioa  if  it  ii  the  turn  is  lost  time,  tbs  actual  ow- 
betdt  at  GLOBAL  and  DISTED  its  lower  than  presented  here,  typically  by  40-70%,  bat  the  order  aaalysef  here 
are  still  valid. 
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not  attempt  to  select  the  globally  "best"  host  for  job  transfer,  but  rather  only  selects  the  least 
loaded  among  a  small  group  of  randomly  picked  hosts.  Although  the  time  it  takes  to  poll  the 
hosts  directly  increases  the  response  time  of  the  waiting  job,  more  up-to-date  load  information  is 
used  for  job  placement.  A  main  reason  GLOBAL  is  not  able  to  perform  better  than  LOWEST  is 
that  there  exists  a  fundamental  contradiction  between  the  need  to  frequently  update  the  load  vec¬ 
tors  at  each  host  and  the  low  utilizations  of  the  load  vectors.  If  the  exchange  period  is  1.0  second, 
and  the  rate  at  which  transfer  decisions  are  made  by  a  host  is  one  job  every  10  seconds,  then  90% 
of  the  load  exchanges  are  wasted. 

The  above  discussion  seems  to  confirm  the  assertion  made  by  Eager  et  al.  [9]  that  more 
complicated  algorithms  than  THRHLD  and  LOWEST  are  not  likely  to  provide  substantially 
better  performance.  In  fact,  the  difference  between  LOWEST  and  the  unrealistic  NoCOST  algo¬ 
rithm  is  already  quite  small.  However,  we  do  not  observe  the  significant  improvement  from  RAN¬ 
DOM  to  THRHLD,  which  was  observed  by  Eager  et  al.  This  may  be  due  to  the  fact  that  these 
authors  did  not  consider  the  message  exchange  costs  explicitly  in  their  study. 

4.2.  Effects  of  System  Seale 

Scalability  is  an  important  issue  in  load  balancing.  On  the  one  hand,  a  larger  pool  of  hosts 
might  improve  the  performance  of  load  balancing.  On  the  other  hand,  the  overhead  of  load 
balancing  might  grow  with  system  size,  and  the  management  of  the  system  becomes  harder.  The 
average  response  times  of  the  ten  algorithms  in  systems  containing  7,  14,  21,  28,  35,  42,  and  49 
hosts  are  shown  in  Figure  6.  To  make  the  comparisons  meaningful,  the  overall  system  utilizations 
are  selected  to  be  within  a  narrow  range  (see  the  host  utilisation  numbers  for  the  NoLB  case  at 
the  top  of  Figure  6),  and  the  response  times  are  normalized  against  that  of  NoLB. 

The  negative  slopes  of  the  NoCOST  and  PartCOST,  as  well  as  those  of  some  of  the  other 
algorithms,  suggest  the  presence  of  economies  of  scale.  As  the  number  of  hosts  in  the  system 
increases,  the  probability  of  finding  a  lightly  loaded  host  increases,  and  the  average  response  time 
can  be  expected  to  decrease.  This  is  most  obvious  for  the  NoCOST  case,  where  the  overhead 
costs  are  not  considered.  For  the  more  realistic  algorithms,  the  overhead  may  increase  with  the 
system  size,  making  the  increase  in  system  size  a  mixed  blessing.  Therefore,  the  scalability  of  an 
algorithm  is  an  important  property.  On  the  other  hand,  it  is  interesting  to  observe  that,  as  the 
number  of  hosts  increases  beyond  28,  the  response  times  improve  very  little.  Therefore,  a  scala¬ 
bility  up  to  a  few  tens,  or  at  most  a  few  hundreds  of  hosts,  seems  sufficient.  Beyond  that  point,  it 
makes  more  sense  to  implement  load  balancing  using  several  clusters  and  perform  inter-cluster 
load  balancing  using  longer-term  load  information.  This  observation  further  enhances  the  values 
of  algorithms  such  as  CENTEX  and  GLOBAL. 

Again,  in  Figure  8,  we  observe  the  relative  performances  of  the  algorithms.  The  scalability 
of  RANDOM,  THRHLD  and  LOWEST  is  very  good.  (Their  curves  are  almost  parallel  to  that  of 
NoCOST.)  This  is  because  the  number  of  hosts  polled  by  the  algorithms  when  a  placement  deci¬ 
sion  is  made  is  independent  of  system  size.  Comparing  GLOBAL  and  DISTED,  we  see  that  the 
former  scales  much  better,  which  may  be  explained  by  the  overhead  analyses  in  Section  4.1.  We 
can  see  two  conflicting  factors  in  action  by  looking  at  tbe  performance  of  the  DISTED  algorithm. 
On  the  one  hand,  an  increase  in  system  size  makes  it  easier  to  find  a  host  with  low  load.  On  the 
other  hand,  the  message  overhead  grows  linearly  with  system  size.  The  composite  effect  is  a 
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Figure  6.  Average  response  time  with  different  system  sizes. 

(normalized  against  the  NoLB  ease) 

moderately  rising  curve  for  the  normalized  response  time.  CENTRAL  remains  the  worst  in  all 
cases,  while  CENTEX  demonstrates  very  good  performance  and  satisfactory  scalability. 

4.3.  Effects  of  Load  Balancing  on  Individual  Hosts 

In  the  previous  studies  of  load  balancing,  it  has  been  frequently  assumed  that  the  hosts  in 
the  system  are  subjected  to  the  same  level  of  load  [0,  17,  21).  (The  job  arrival  rates  and  the  pro¬ 
cessing  rates  of  all  the  hosts  are  the  same.)  However,  this  is  usually  not  the  case  in  production 
environments.  It  is  very  interesting  to  study  the  effect  of  load  balancing  on  the  individual  hosts, 
especially  those  originally  with  light  loads.  At  the  beginning  of  this  research,  we  conjectured  that, 
while  load  balancing  may  improve  overall  system  performance  and  that  of  the  heavily  loaded 
hosts,  the  lightly  loaded  ones  may  suffer  degradations  in  their  performance  because  additional  jobs 
are  transferred  to  them.  We  were,  therefore,  pleasantly  surprised  by  the  results  from  the  simula¬ 
tions.  The  average  response  times  of  the  individual  hosts,  with  and  without  load  balancing,  are 
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shown  in  Figure  7.  As  can  be  observed,  the  performances  of  all  hosts  are  generally  improved, 
with  the  hosts  under  heavy  loads  showing  greater  improvements.  Figure  7  clearly  demonstrates 
the  power  of  dynamic  load  balancing:  system  performance  may  be  greatly  improved  by  taking 
advantage  of  the  temporal  differences  among  the  hosts'  loads,  and  even  hosts  with  light  loads 
benefit  as  congestions  on  them,  though  infrequent,  can  be  relieved  by  other  hosts. 
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Figure  7.  Mean  response  times  of  individual  hosts. 

(Utilisation  without  load  balancing  is  83.3%.) 

Another  beneficial  effect  of  load  balancing  revealed  by  Figure  7  is  that  it  makes  the  response 
times  more  predictable.  For  some  environments,  this  is  eves  more  important  than  the  reduction 
in  the  mean  response  time.  Figure  8  provides  a  direct  measure  of  this  effect:  while  the  average 
response  time  is  cut  by  a  factor  of  1.5  to  2.0,  its  standard  deviation  is  cut  by  a  factor  of  2  to  4. 
The  measurements  in  Figures  7  and  8  are  taken  when  the  system  is  moderately  loaded  (the  utili¬ 
sation  for  the  NoLB  case  is  63.3%).  The  improvements  of  the  mean  and  standard  deviation  of 
response  time  are  observed  to  be  more  drastic  when  the  system  load  level  is  higher. 

The  term  load  balancing  has  in  it  the  implicit  meaning  of  equalising  the  loads  of  the  partici¬ 
pating  hosts.  Though  this  is  not  our  objective,  the  equalising  effect  of  the  algorithms  studied  in 
this  paper  can  be  clearly  seen  in  Figures  9  and  10.  This  is  more  pronounced  with  GLOBAL  and 
D1STED  than  with  LOWEST  because  of  the  attempt  of  the  former  two  algorithms  at  system-wide 
optimisation.  It  is  interesting  to  note,  in  Figures  7  and  10,  that  the  average  response  time  of  all 
the  jobs  originating  from  a  host  may  decrease  while  the  host's  utilisation  increases.  The  amount 
of  overhead  introduced  by  a  load  balancing  algorithm  can  be  found  in  Figure  10  by  subtracting 
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Figure  8.  Standard  deviation*  of  response  time*  of  individual  fao*U. 

(Utiliiation  without  load  balancing  is  83.3%.) 

tbe  average  CPU  utiliiation  with  no  load  balancing  from  that  obtained  with  a  specific  algorithm. 
As  can  be  expected,  the  overhead  of  GLOBAL  i*  higher  than  that  of  LOWEST,  while  both  pro¬ 
vide  similar  performance.  This  is  mainly  because  of  the  high  frequency  of  load  information 
exchanges  in  GLOBAL.  The  overhead  on  the  LIC  in  GLOBAL  is  proportional  to  the  system  scale, 
and  is  higher  than  that  on  the  other  hosts.  This  is  reflected  by  the  significant  increase  in  utilisa¬ 
tion  of  host  1,  which  we  use  as  the  LIC.  While  this  reveals  the  limitation  of  the  scaling  nbility  of 
the  algorithm,  we  do  not  consider  it  a  serious  drawback  because  our  simulation  results  show  that 
a  single  LIC  is  capable  of  supporting  SO- 100  hosts  ander  our  cost  assumptions.  At  that  point,  the 
economies  of  scale  have  almost  no  effect,  and  it  is  reasonable  to  implement  load  balancing  in 
several  clusters. 

4.4.  Parameter  Selection  and  Adaptive  Load  Balancing 

Once  the  load  balancing  algorithm  is  decided,  the  performance  is  still  sensitive  to  the  specific 
parameter  values  adopted,  la  this  section,  we  assess  the  degree  of  such  sensitivity.  The  adju¬ 
stable  parameters  depend  on  the  algorithm.  For  all  of  the  algorithms,  we  have  a  local  load  thres¬ 
hold  and  a  job  threshold.  In  addition,  for  the  periodic  information  policies,  we  have  the  load 
exchange  period,  whereas  for  the  non-periodic  policies,  we  have  the  probe  limit.  Figure  11  shows 
the  performance  of  the  GLOBAL  algorithm  under  various  parameter  combinations.  The  local 
load  threshold  is  set  to  sera.  We  can  see  that  the  exchange  period  has  a  strong  influence  on  tbe 
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Figure  9.  Average  queue  lengths  of  individual  hosts 
nnder  different  load  balancing  algorithms. 

(Utilisation  without  load  balancing  is  63.3%.) 

performance.  When  the  period  is  too  short  (e.g.,  0.35  second),  the  overhead  is  so  high  that,  even 
though  the  load  information  on  which  the  transfer  decisions  are  based  is  very  up  to  date,  the  per¬ 
formance  suffers.  On  the  other  hand,  if  the  exchange  period  is  too  long  (e.g.,  10  seconds),  the  load 
information  is  so  out  of  date  that  frequent  mistakes  are  made  in  job  placements.  (Jobs  are  sent 
to  hosu  with  equal  or  higher  load  than  the  local  host.) 

In  contrast,  performance  seems  to  be  less  sensitive  to  the  job  threshold.  (The  average 
response  time  when  only  jobs  with  CPU  execntion  time  greater  than  1.0  second  are  considered  for 
load  balancing  is  close  to  those  when  jobs  above  0.5  or  2.0  seconds  are  considered.)  This  observa¬ 
tion  supports  our  earlier  claim  that  only  as  approximate  separation  between  big  and  small  jobs  is 
necessary  to  achieve  good  performance.  Looking  more  closely,  we  again  observe  a  similar  pattern 
with  the  job  threshold:  when  the  job  threshold  is  too  high  (e.g.,  3  seconds),  the  full  potential  of 
load  balancing  is  not  realised,  whereas  when  it  is  too  low  (e.g.,  0.25  second),  the  overhead  of  job 
transfers  outweighs  the  beaeflt,  and  performance  becomes  worse.  There  is  also  interaction 
between  the  two  parameters;  when  the  exchange  period  is  lengthened,  the  corresponding  optimal 
job  threshold  increases. 

It  is  important  to  recognise  that  the  combination  of  parameters  that  yields  the  best  perfor¬ 
mance  is  highly  dependent  on  the  system  load  level.  Table  1  attempts  to  illustrate  this.  Gen¬ 
erally  speaking,  the  higher  the  load,  the  higher  the  job  threshold  and  the  longer  the  exchange 
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Figure  10.  Utilisations  of  individual  hosts  under  different  load  balancing  algorithms. 

period  should  be.  For  LOWEST,  an  increase  in  the  probing  limit  may  yield  poorer  performance 
when  the  load  is  high. 

The  sensitivity  of  load  balancing  performance  over  the  parameter  values  suggests  that  some 
form  of  adaptive  load  balancing  may  be  able  to  provide  good  performance  when  system  load 
changes  widely.  Under  adaptive  load  balancing,  the  system  load  is  constantly  monitored,  and 
changes  in  algorithms  and/or  adjustable  parameters  are  made  as  the  load  changes  so  that  the  sys¬ 
tem  is  always  operating  at,  or  close  to,  the  optimal  point.  Supporting  multiple  algorithms 
involves  complicated  implementation,  and  changes  in  algorithms  cannot  be  made  frequently. 
Furthermore,  for  the  most  promising  algorithms,  GLOBAL,  CENTEX,  aad  LOWEST,  the  perfor¬ 
mance  differentials  are  quite  small.  Consequently,  the  gain  from  switching  algorithms  is  probably 
insignificant,  and  therefore  not  worth  the  effort.  However,  we  are  not  making  a  general  statement 
here;  for  environments  different  from  ours,  aad  for  algorithms  other  than  the  ones  we  studied, 
using  different  algorithms  under  different  loads  might  be  quite  advantageous. 

In  contrast  to  algorithmic  change,  parameter  adjustments  are  much  simpler,  aad  capable  of 
significantly  improving  performance  when  the  system  load  luctuates  widely.  Here,  we  need  a 
system-wide  mechanism  that  monitors  load  conditions  and  makes  adjustment  decisions.  GLOBAL 
and  CENTEX  are  the  most  appropriate  for  this  purpose.  The  LIC  periodically  receives  load  infor¬ 
mation  from  all  the  hosts,  aad  can  use  such  information  to  deduce  the  system  state.  It  can  then 
send  to  the  hosts  the  parameter  values  they  should  use. 
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Figure  11.  Fffect  of  Adjustable  Parameters  on  Load  Balancing  Performance. 

(GLOBAL,  Id  hosts,  utilisation  without  load  balancing:  72.3%) 

4.5.  Avoidance  of  Instability 

The  problem  of  instability  introduced  by  load  balancing  is  of  major  concern  to  the  research¬ 
ers  in  this  field.  It  is  feared  that,  because  of  the  delay  in  load  information  exchanges,  several 
hosts  may  transfer  jobs  to  a  once  lightly  loaded  host,  and  cause  it  to  be  overloaded.  After  the 
load  information  is  updated,  some  other  host(s)  may  become  the  vktim(s).  We  call  such 
phenomenon  of  overloading  hosts  in  turn  hott  overloading.  Another  form  of  instability  is  job 
thrashing,  in  which  jobs  are  transferred  too  many  times  (or  even  for  an  indefinite  number  of 
times,  as  analytically  shown  in  (9|  for  RANDOM)  in  an  attempt  to  find  the  optimal  host  for  job 
execution.  Host  overloading  causes  performance  degradation  because  of  unstable  and  uneven  load 
distribution  among  the  hosts,  whereas,  for  job  thrashing,  degradation  is  mainly  due  to  excessive 
job  transfer  overhead.  Since  we  are  mostly  concerned  with  algorithms  that  transfer  jobs  only 
once,  we  will  study  the  host  overloading  problem  here. 

We  consider  a  job  transfer  wrong  if  the  destination  host's  CPU  queue  length  is  equal  to  or 
greater  than  that  of  the  originating  host.  There  is  a  distinction  between  transferring  a  job 
wrongly  and  collectively  overloading  a  host;  the  former  by  itself  will  only  increase  the  particular 
job's  response  time,  whereas  the  latter  will  potentially  cause  system-wide  performance  degrada¬ 
tion,  due  to  the  aggravated  effects  of  the  individual  wrong  transfers.  This  problem  can  be  serious 
because  usually  the  transferred  jobs  are  big.  la  order  to  quantitatively  measure  the  level  of  host 
overloading  occurring  in  a  system,  we  define  the  hott  overloading  factor  T  to  be  the  percentage  of 


Table  1.  Optimal  Parameter  Values  under  Different  System  Load  Levels  (28  hosts). 
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There  are  a  number  of  factors  that  affect  the  level  of  host  overloading,  all  having  something 
to  do  with  the  rate  at  which  wrong  transfers  are  made  to  a  host  because  T  is  roughly  proportional 
to  this  rate.  First,  the  staleness  of  load  information  has  a  deciding  effect.  The  staler  the  informa¬ 
tion,  the  more  the  jobs  that  are  transferred  wrongly.  Therefore,  the  non-periodic  information  pol¬ 
icies  that  collect  load  information  on  demand  are  less  susceptible  to  host  overloading  than  the 
periodic  policies.  Another  important  factor  is  the  rate  at  which  jobs  that  are  candidates  for 
transfer  arrive.  This  depends  on  the  system  load  level  and  the  job  threshold.  The  higher  the  load 
and  the  lower  the  job  threshold,  the  larger  the  percentage  of  eligible  jobs.  To  verify  our  intuitive 
argument,  we  calculated  T  in  simulation  experiments  for  the  GLOBAL  algorithm  using  various 
load  exchange  periods  and  job  thresholds.  Since  it  is  difficult  to  consider  three  factors  all  chang¬ 
ing  at  the  same  time,  we  llxed  the  system  load  level  at  79%.  Such  a  system-wide  utilization  is 
high,  and  host  overloading  may  be  expected  to  be  quite  serious.  The  results  are  shown  in  Figure 
12,  and  agree  with  oar  intuition.  It  seems  that  host  overloading  does  not  have  as  disastrous 
effects  on  system  performance  as  we  feared:  very  good  performance  can  be  achieved  even  when 
there  exists  light  overloading  (T  <  10%). 

Besides  load  update  frequency  and  job  threshold,  the  system  scale  also  affects  host  overload¬ 
ing,  but  to  a  lesser  degree.  It  is  important  to  know  the  number  of  hosts  with  the  least  load.  For 
the  algorithms  studied  in  this  paper,  placement  decisions  are  based  on  the  instantaneous  CPU 
queue  lengths  of  the  hosts.  Since  there  may  be  more  than  one  host  with  the  same  shortest  queue 
length,  the  transferred  workload  may  be  shared  by  them,  thus  reducing  overloading.  A  larger 


Figure  12.  Percentage  of  wrong  job  placements  for  GLOBAL  under 
various  load  exchange  periods  and  job  thresholds. 

(Number  of  hosts:  14,  Average  utilisation:  79%) 

system  size  makes  such  situation  more  probable.  On  the  other  hand,  in  a  larger  system,  there  are 
also  more  sources  of  transferred  jobs.  To  quantitatively  study  the  number  of  hosts  with  the  least 
load  as  a  function  of  system  size  and  load  update  period,  we  recorded  the  load  vector  at  a  high 
frequency  during  a  simulation  experiment  for  GLOBAL,  and  counted  the  number  of  hosts  with  the 
least  number  of  jobs  at  their  CPU’s.  The  actual  shortest  queue  length  is  unimportant  because  we 
are  only  concerned  with  the  relative  distribution  here.  Figure  13  shows  the  distributions  for  sys¬ 
tems  with  14  and  28  hosts,  and  the  exchange  period  fixed  at  5  seconds.  For  shorter  exchange 
periods,  the  means  of  the  number  of  hosts  with  the  least  load  are  slightly  lower.  We  find  that  the 
probability  of  having  only  one  or  two  hosts  with  the  least  load  is  non-negligible;  hence  host  over¬ 
loading  can  occur,  as  revealed  using  another  metric  in  Figure  12.  Consider  the  following  case, 
which  was  found  to  be  typical:  for  a  system  with  14  hosts  and  a  load  level  of  80%,  the  total  rate 
at  which  jobs  are  transferred  by  the  GLOBAL  algorithm  using  a  job  threshold  of  1.0  second  is  1-2 
jobs/second.  This  means  that,  if  we  update  load  information  every  5  seconds,  5-10  jobs  may  be 
transferred  to  the  single  host  that  used  to  have  the  least  load?  This  range  is  reduced  to  1-2  jobs  if 
the  exchange  period  is  1.0  second,  and  even  lower  if  the  system  load  is  not  at  such  a  high  level. 
Hence,  we  tee  that  whether  host  overloading  occurs  depends  primarily  on  the  system  load  and  the 
load  exchange  period. 
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Figure  13.  Distribution  of  the  number  of  hosts  with  the  least  load. 

Load  Expd  «■  5.0  sec.,  Job  Thrd  *■  1.0  sec. 

4.(1.  Immobile  Jobe 

Throughout  our  studies  so  far,  we  have  assumed  that  the  jobs  are  mobile,  that  is,  they  can 
be  executed  on  any  host  in  the  system  with  exactly  the  same  results.  Although  this  assumption 
holds  for  a  large  subset  of  the  jobs,  we  do  observe  that  some  of  the  jobs  are  immobile.  Examples 
include  jobs  that  perform  local  services  and/or  require  local  resources,  such  as  system  daemons, 
login  sessions,  mail  and  message  handling  programs,  and  so  on.  There  are  also  highly  interactive 
jobs,  such  as  command  interpreters  and  editors,  for  which  remote  execution  will  result  in  poor  per¬ 
formance  due  to  network  latencies.  Any  implementation  of  load  balancing  must  take  the  effects 
of  these  immobile  jobs  into  consideration.  We  define  the  immobility  factor  to  be  the  percentage 
of  jobs  that  have  to  be  executed  on  the  local  host,  but  are  otherwise  eligible  for  load  balancing. 
By  varying  the  value  of  the  immobility  factor,  the  effect  of  immobile  jobs  is  revealed.  For  a  sys¬ 
tem  of  28  hosts  with  an  average  CPU  utilization  of  63.7%,  the  results  are  shown  in  Figure  14. 

The  concave  shapes  of  the  curves  are  encouraging,  as  they  indicate  that  effective  load 
balancing  is  still  possible  even  if  a  significant  proportion  of  the  jobs  are  immobile.  For  an  immor 
bility  factor  of  0.4,  the  mean  response  time  is  only  slightly  higher  than  that  for  the  case  in  which 
all  jobs  are  mobile  (the  immobility  factor  being  0).  This  observation  is  not  surprising  because 
load  balancing  is  achieved  by  only  transferring  a  portion  of  the  eligible  jobs.  (Typically,  50-70% 
of  the  eligible  jobs  were  actually  transferred  in  the  simulation  experiments.)  Consequently,  even 
though  some  of  the  eligible  jobs  are  immobile,  the  rest  of  them  can  still  produce  most  of  the 
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Figure  14.  Effect  of  immobile  jobs, 
performance  benefits  due  to  the  balancing  effect. 

S.  Conclusions 

In  this  paper,  we  studied  the  load  balancing  problem  using  simulation  models  driven  by  job 
traces  collected  from  a  production  system.  We  simulated  a  CPU  scheduling  policy  that  is  believed 
to  be  representative,  and  we  considered  explicitly  the  costs  of  load  information  exchange  and  job 
transfers.  Because  of  the  generality  of  the  model  and  the  use  of  live  system  data,  the  results  of 
our  simulation  are  believed  to  be  more  reliable  than  those  from  analytic  models  or  simulations 
driven  by  probability  distributions.  On  the  other  hand,  our  results  might  be  biased  towards  a  par¬ 
ticular  type  of  computing  environment. 

Seven  load  balancing  algorithms  were  studied,  including  both  ones  using  periodic  information 
policies  and  ones  using  non-periodic  policies.  We  found  that  load  balancing  using  any  reasonable 
algorithm  can  provide  substantial  performance  improvement  over  the  NoLB  case.  Specifically,  the 
average  response  time  of  all  the  jobs  may  be  reduced  by  30-60%,  and  the  reduction  in  its  stan¬ 
dard  deviation  is  even  more  drastic,  making  the  job  response  times  much  more  predictable  than  in 
the  NoLB  case.  The  higher  the  load,  the  greater  the  improvement,  and  longer  jobs  benefit  mort 
from  load  balancing.  Looking  more  closely,  we  found  that  the  performance  of  all  hosts,  even 
those  originally  with  light  loads,  improve  under  effective  load  balancing.  This  is  somewhat 
counter  intuitive,  but  very  encouraging:  by  cooperating  with  each  other,  no  one  loses.  We  also 
observed  a  strong  tendency  of  load  balancing  to  equalise  the  loads  of  the  individual  hosts;  both 
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tbe  utilizations  and  the  average  CPU  queue  length*  of  the  host*  cluster  within  a  small  range. 

By  varying  the  size  of  the  system,  we  observed  significant  but  limited  economies  of  scale. 
For  example,  when  four  systems  each  with  7  hosts,  or  two  systems  each  with  14  hosts,  are  com¬ 
bined  into  a  single  system  of  28  hosts,  the  average  response  time  is  significantly  reduced  for  the 
algorithms  with  good  scalability.  However,  beyond  28  hosts,  the  improvement  diminishes  quickly. 
Consequently,  a  scalability  of  an  algorithm  up  to  50-100  hosts  seems  to  be  sufficient,  and  cluster¬ 
ing  techniques  should  be  used  in  large  Kale  systems  to  avoid  the  potentially  increasing  overhead 
and  tbe  management  complexities. 

For  the  periodic  load  information  policies,  we  found  that  the  global  approach  has  much  less 
overhead  than  the  distributed  approach,  and,  therefore,  performs  better  and  is  more  Kalable.  The 
periodic  and  non-periodic  policies  provide  comparable  performances  under  our  cost  assumptions. 
The  algorithms  that  collect  load  information  on  demand  (RANDOM,  THRHLD  and  LOWEST) 
have  the  advantages  of  lower  message  exchange  overhead,  of  being  able  to  Kale  better,  and  of 
being  less  suKeptible  to  host  overloading.  On  the  other  hand,  the  algorithms  that  rely  on  periodic 
load  exchanges  (GLOBAL,  CENTEX,  and  D1STED)  have  the  advantages  of  being  able  to  poten¬ 
tially  choose  the  optimal  hosts  for  job  transfers,  thus  offering  better  performance,  and  of  not  sub¬ 
jecting  the  jobs  eligible  for  transfer  to  the  delays  in  getting  load  information. 

The  performance  of  load  balancing  using  the  algorithms  studied  in  this  paper  is  found  to  be 
quite  sensitive  to  the  values  of  the  local  load  threshold,  load  exchange  period,  and  host  probing 
limit.  The  combinations  of  the  parameter  values  that  provide  optimal  performances  are  in  turn 
dependent  on  the  system  load  level.  Consequently,  adaptive  bad  balancing  has  the  promising 
potential  of  maintaining  optimal  performance  under  changing  system  configuration  and  bad.  The 
GLOBAL  and  CENTEX  algorithms  are  the  most  appropriate  for  this  because  of  the  presence  of 
the  LIC.  More  research  is  called  for  in  this  area. 

Host  overbading  is  not  significant  for  non-periodic  algorithms,  but  may  be  serious  with  the 
periodic  algorithms.  The  deciding  factors  are  the  bad  update  frequency,  the  system  bad,  the  job 
threshold,  and  the  system  size.  By  using  reasonably  up-to-date  bad  informatbn  and  only 
transferring  a  small  percentage  of  jobs,  host  overbading  can  be  effectively  alleviated.  Suboptimal 
placement  decisions  may  produce  better  performance  than  "optimal"  decisions,  because  overbad¬ 
ing  on  one  or  a  small  number  of  hosts  may  thus  be  avoided.  Host  overbading  is  sot  as  serious  as 
we  expected  —  very  good  performance  can  be  achieved  even  when  it  occurs  occasionally. 

The  impact  of  immobile  jobs  on  bad  balancing  is  found  to  be  bus  serious  than  the  immobil¬ 
ity  factor  might  suggest:  most  of  the  performance  gains  are  still  retained  even  when  up  to  50%  of 
the  jobs  are  immobile. 

We  have  been  very  much  encouraged  by  the  true  e-drives  simulation  approach  takes  in  this 
research;  it  proves  to  be  capable  of  handling  greater  complexities  and  of  providing  more  credible 
performance  results  than  the  approaches  used  before  in  bad  balsaciag  research.  On  the  other 
hand,  we  only  used  data  from  a  particular  type  of  time-sharing  environment,  and  so  the  generality 
of  our  results  is  limited.  The  simplifying  assumptbns  made  in  this  research,  though  leas  unrealis¬ 
tic  than  those  of  the  previous  studies,  may  also  have  introduced  errors  is  our  results.  It  would  be 
very  interesting  to  apply  the  techniques  used  in  this  research  to  other  types  of  computing  environ¬ 
ments,  especially  server-based  workstatbn  environments,  and  to  compare  the  findings.  Such 
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efforts  tre  currently  being  planned. 

In  view  of  the  proliferation  of  distributed  systems,  and  of  the  great  potential  of  load  balanc¬ 
ing  as  demonstrated  in  this  research  and  by  other  authors,  it  is  highly  desirable  that  load  balanc¬ 
ing  be  made  a  standard  service  in  future  distributed  systems  to  substantially  increase  the  perfor¬ 
mance  of  the  system  without  adding  any  resources.  Unfortunately,  only  a  few  implementations 
exist,  and  most  of  them  were  done  in  an  ad  hoc  fashion  [2,  13,  14,  20].  Besides  the  implementa¬ 
tion  difficulties  involved,  a  general  lack  of  understanding  of  the  performance  characteristics  of  the 
algorithms  proposed  and  the  engineering  tradeoffs  involved  are  the  major  obstacles.  Trace-driven 
simulation  appears  to  be  an  appropriate  tool  for  load  balancing  studies,  and  should  be  well 
exploited  before  an  implementation  effort  starts,  because  the  latter  is  much  more  costly. 
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